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1 Introduction 


The numerical solution of large sparse eigenvalue problems arises in numerous important 
scientific applications that can be termed supercomputing applications. The advances in 
supercomputing technology allow today to tackle very large eigenvalue problems that were 
not feasible a few years ago. Moreover, this trend is likely to continue as the scientific models 
will become more sophisticated and as the numerical methods will improve in efficiency and 
reliability. This paper considers the problems encountered when adapting two well-known 
numerical techniques for solving eigenvalue problems, to supercomputing environments. The 
two methods considered are the Lanczos algorithm and Davidson’s method. The first is a 
well-known technique which has had impressive success in the last two decades, for solving 
standard symmetric eigenvalue problems. It main attraction is that it can build, at least in 
theory, an orthogonal basis of the so-called Krylov subspace K m = span{v, Av , ..., A m-1 v} 
with the help of a simple three-term recurrence. To extract the Ritz eigenvalues from this 
subspace, one only needs to compute eigenvalues of a tridiagonal matrix whose coefficients 
are obtained from the three-term recurrence. In Davidson’s method one constructs explicitly 
an orthogonal basis of a certain subspace which is obtained by adding a vector of the form 
(M — 9I)~ 1 (A — 9I)v where M is a preconditioner of A, 0 is a shift close to the eigenvalue 
being computed and v is the previous basis vector. This can be viewed as a preconditioned 
version of the Lanczos algorithm. The advantage is faster convergence, i.e., the subspace 
for which convergence is achieved in Davidson’s method is of much smaller dimension than 
with Lanczos. On the other hand each iteration now costs much more since the three 
term recurrence is lost and one must explicitly orthogonalize each new vector against all 
previous ones. However, for many problems arising in specific applications such as chemistry, 
Davidson’s method is still superior. 

Though very similar in theory, the two methods are different from the computational 
point of view and the problems encountered when implementing them on supercomputers 
are not the same. Their only common point is that they both use at each step a matrix 
by vector multiplication. There are issues, such as reorthogonalization for the Lanczos 
algorithm, that are proper to one of the methods only. 

In the next section we describe the two methods and compare their advantages and 
disadvantages. Then we will discuss the problem of performing a matrix by vector product 
on supercomputers. In Section 4, we will describe implementation details and numerical 
experiments for the CRAY 2 and for the CRAY X-MP. Finally in Section 5, we will give 
some ideas on how to adapt the two methods on a parallel machine with a moderate number 
of processors. 


2 The Lanczos and Davidson methods 

2.1 The Lanczos Algorithm 

The Lanczos algorithm in its basic form, is an algorithm for computing an orthogonal basis 
of a Krylov subspace, i.e., a subspace of the form K m = span{v\, Avi,..., A m ~ l vi). The 
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algorithm is as follows. 

Start: Choose an initial vector vj. 

Iterate: for j = 1, 2, .., m, .. Do 

• y := 

• If j > 1 compute y y- 0jVj-i 

• ■= (y, vj) 

• y-=y- QjVj 

• 0j + 1 := llyfla 

• wj+i *•= 

The main iteration in the above algorithm can be succinctly described by the following 
three term recurrence: 

0j+i v j+i = Avj - ctjVj - pjVj-x (1) 

where atj and /3j + i are selected so that the vector Vj+ 1 is orthogonal to both Vj and Vj-i. 
It can be shown that this ensures that the sequence of vectors u< forms an orthonormal 
sequence. 

If we denote by V m the N x m matrix 


V m = [ui,U2,...,U m ] 

and by T m the tridiagonal matrix: 

foi 02 \ 

I 02 <*2 03 I 


' ftm &m f 

then we have the relations, 

A.V , n — V m T m + ^TO+lV m +ie m 


( 2 ) 


and 

vlAv m = r m 

Therefore, it is natural to approximate the eigenvalues of A by some of the eigenvalues of T m . 
This will constitute a Rayleigh-Ritz [21] procedure on the Krylov subspace K m . It is known 
[21] that the outmost eigenvalues of T m , will rapidly become good approximations to those 
of A. In this paper we will assume that we Me interested in the smallest eigenvalues so we 
number all eigenvalues, exact and approximate, increasingly and denote by the i — th 
eigenvalue of T m , and j/j m * the corresponding eigenvector. The corresponding approximate 
eigenvector of A is given by 


3 



Although the above procedure seems very simple to implement there are several pitfalls. 
One of the first observations made on the Lanczos algorithm is the loss of orthogonality 
of the vectors u,-. We can only summarize the main findings concerning the analysis on 
loss of orthogonality but we refer the readers to [21] for details. Loss of orthogonality is 
triggered by convergence of some eigenvector and is avoidable only at the price of some form 
of reorthogonalization. The simplest reorthogonalization possible is full reorthogonalization: 
at every step one reorthogonalizes the current Lanczos vector against all previous ones. 
However, it was discovered that loss of orthogonality does not prevent convergence, it can at 
worst slow it down [19]. As a result several authors suggested using the algorithm without 
reorthogonalization. The implementation of the algorithm without reorthogonalization is 
complicated by several factors. First, the approximate eigenvalues may now appear several 
times. Moreover, one must be careful about monitoring convergence of eigenvalues: some 
eigenvalues will seem to converge and then disappear for a while only to reappear later. This 
does not happen with full reorthogonalization. The advantages of the algorithm without 
reorthogonalization are clear, the avoidance of storing all the Lanczos vectors being perhaps 
the most important one. ^From the point of view of cost, orthogonalization by Gram Schmidt 
is rather expensive. 

For additional details on the Lanczos method, see [4,13,23,25,26]. 

2.2 The Davidson Algorithm 

Similarly to the Lanczos algorithm, the Davidson algorithm [6] is based on the projection of 
the matrix over a sequence of subspaces of increasing dimension. In some sense, the method 
can be considered as a preconditioned version of the Lanczos method [17] although the 
context is rather different from that of preconditioned techniques for solving linear systems. 
If the preconditioner is efficient, the convergence can be very fast. The major drawback 
of the method comes from the amount of work involved in one iteration, which increases 
with the dimension of the subspace : in contrast with the Lanczos algorithm, the restricted 
matrices are full and they must explicitly be computed; moreover, it is now necessary to 
store the orthonormal basis and its resulting transformation by the original matrix. Hence, 
the process must be restarted periodically with the current best estimate of the wanted 
eigenvector. The algorithm is built around two embedded loops. To compute the largest 
(resp. smallest) eigenvalue of A, the process can be described as follows. 
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Start: Choose an initial unit vector Vj. 

Iterate: for iter = 1,2, ... Do 

Iterate: for j = 1, ..., k Do 

• Wj := AVj (only the last column is computed) 

• Hj := VjWj (only the last column is computed) 

• computation of the largest eigenpair of H ) : (A, y) 

• x := Vjy (Ritz- vector) 

• r := Wjy — Ax (residual) 

• test for convergence 

• If j < k : 

t := ( M — A/) -1 r (M is the preconditioner) 

Vf+i := MGS{\Vj, <]) (Modified-Gram-Schmidt proce- 
dure) 

V\ := x 


The simplest and most common preconditioner M is the main diagonal of A (Jacobi 
Preconditioner). It can only be used when the matrix A is nearly diagonal in the sense that 
its matrix of eigenvectors is close to the identity ; this is often the situation in Quantum 
Chemistry and this is the reason for the method is popular there. 

It should be noticed that, if no preconditioner were to be used (ie. M = Id), then the 
sequence of subspaces would become identical with that of the Krylov subspaces and then 
both methods, Lanczos and Davidson, would theoretically be equivalent. However, since the 
considered orthonormal basis is not the same, the computation will remain heavier with the 
Davidson method. 

When one seeks several eigenvalues or when one knows that the desired eigenvalue is very 
close from some others, a block version of the algorithm can be used : several eigenpairs of 
Hj are computed at the same time and then several vectors are added to the basis Vj. 

2.3 Complexity of the Algorithms 

We now consider the complexity of one outer iteration of the Davidson algorithm when only 
one vector is added at every inner step to the basis. The cost of each such outer loop is 
roughly, 


Gout ® 


k 

+ i n + 2j 3 + jn + ( j + l)n + p r + (2 j + l)n 


j=i 

k{n z + p r ) + 5/2k(k + 9/5 )n + k 2 {k + l) 2 
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where n, n x and p r stand respectively for the order of the matrix A , its number of non-zero 
entries and the complexity of the preconditioning step. When the Jacobi preconditioner is 
used p r = n and the average cost for the inner iteration is then 

Cinn « n x + p, + 5/2 (k + 9/5)n + k(k + l) 2 

This shows that the cost per step increases quadratically with the step number fc, and a 
reasonable upper limit for k would be k = y/n in which case the average cost per inner-step 
comes to 

Cinn «n 2 + p J + 7/2n 3/2 + 0(n). 

This number has to be compared to (n, + 5n) which is the complexity for each Lanczos 
step, without reorthogonalization. It becomes clear that, in order to be competitive, the 
method needs a preconditioner which will strongly reduces the number of steps to achieve 
convergence. 

For the block version of the Davidson algorithm, the complexity per outer step remains 
roughly unchanged, but of course the number of necessary steps for convergence increases 
with the number of eigenpairs sought. 

3 The problem of matrix by vector multiplications 

Since both methods use a matrix by vector multiplication at each step the procedure to 
perform this operation must be carefully implemented. The typical matrices dealt with, are 
sparse and often unstructured and as a result very poor performance may prevail if not 
enough care is devoted to optimizing this operation. We review here some of the possible 
options we have to improve the speed of this basic kernel by exploiting vectorization and 
parallelization. 

To parallelize the multiplication, it is easy to consider A as a sum of some elementary 
matrices A = A t , to compute in parallel (y t := A t x) t=linproc and then accumulate the 

partial results y := Vt- Because the vector x must be read by all the processors, it 

should remain in the global memory or be duplicated in the local memories. The usual way 
to perform the partitioning is to define blocks of consecutive rows of A in order to bypass 
the accumulation step. A special case consists of considering one row only per block and 
perfor ming all the scalar products in parallel. Workload would obviously be balanced when 
the number of non-zero entries per processor is roughly the same. 

It may be pointed out that in the block version of the Davidson method parallelism can 
also be achieved by performing independently several multiplications. We will not discuss 
this obvious additional possibility which can also be exploited in the block form of the 
Lanczos algorithm. 

We now would like to examine in detail the procedure on one processor which is assumed 
to be a vector processor. The first observation that has been made in this context is that 
this operation can be performed by diagonals when the matrix is regularly structured, i.e., 
when it consists of a few diagonals [12]. The matrix can be stored in a rectangular array 
DIAG{ 1 : N, 1 : NDIAG) and the offsets of these diagonals from the main diagonal may 


6 



be stored in a small integer array IOFF{ 1 : NDIAG). After initializing the vector y 
to zero, the main loop for computing y = Ax is expressed in FORTRAN 8-X as follows. 

DO 10 J-l, NDIAG 
JOFF - IOFF(J) 

Y(1:N) » Y(1:N) + DIAG(1 :N, J)*X(J0FF+1 : JOFF+N) 

10 CONTINUE 

Excellent megaflops rates can be reached on vector machines when the matrix is large enough. 

For general sparse matrices there has been several attempts to obtain similar perfor- 
mances by either generalizing the diagonal storage scheme [18,27] or by reordering the ma- 
trix so as to obtain a diagonal structure [1,22]. We will only discuss the first approach 
here. This approach is of interest only for matrices whose maximum number of nonzeros 
per row jmax (which is called the degree of the row) is small. One then stores the en- 
tries of the matrix in a real array COEFF( 1 : n, 1 : jmax) together with an integer array 
JCOEFF( 1 : n, 1 : jmax ) that stores the column numbers of each entry of COEFF. We 
refer to this as the ITPACK format. The above FORTRAN loop then becomes, 

DO 10 J-l, NDIAG 

Y(1:N) - Y(1:N) + C0EFF(1:N, J)*X(JC0EFF(1:N,J)) 

10 CONTINUE 

The main difference between this loop and the previous one is the presence of indirect 
addressing in the innermost computation. Note that if the degree of the rows varies substan- 
tially, then many zero elements must be stored unnecessarily, and this scheme may become 
inefficient. 

The above storage schemes are somewhat specialized to certain types of matrices. These 
can be very useful in many instances but their lack of generality is a serious limitation. 
Unfortunately, as is often the case, there is a conflict between generality and efficiency. 

On of the most general schemes for storing sparse matrices uses a real array A(1 : NNZ) 
which contains the nonzero elements of the matrix, stored row- wise, an integer array JA{ 1 : 
NNZ) which stores the column positions of the corresponding elements in the real array 
A, and finally a pointer integer array IA(1 : N + 1) the i-th element of which points to 
the be ginning in the arrays A and J A of the consecutive rows. This data structure is often 
referred to as the general sparse format, or the A, JA, I A format. With this storage scheme 
each component of the resulting vector y can be easily computed independently as the dot 
product of the t-th row of the matrix with the vector x. We can write this as 

DO 10 >1, N 

K1 - IA(I) 

K2 » IAU+D-1 

Y(I) » D0TPR0DUCT ( A(K1:K2) , X(JA(K1:K2)) ) 

10 CONTINUE 

The outer loop can be performed in parallel, as mentioned before. On a machine like the 
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2-D 

3-D 

Markov 

matrix order 

2765 

3096 

14917 

non-zero entries 

16238 

40606 

128773 

maximum degree 

9 

38 

17 

minimum degree 

3 

6 

1 

average degree 

6 

13 

9 


Table 1: Characteristics of three sparse matrices. 

Alliant FX-8, the synchronization of this outer loop is inexpensive and the performance of 
the above program can be excellent. 

The indirect addressing involved in the second vector in the dot product loop is handled 
by a special hardware instruction called a Gather operation. The vector X(JA(kl : ifc2)) is 
first gathered from memory into a vector of contiguous elements. The dot product is then 
carried out as regular dot product operation. The first vector machines that appeared did 
not perform too well on sparse computations because they were not equipped with special 
instructions for Gather and Scatter. The beneficial impact of Hardware Scatter and Gather 
on vector machines has been discussed in [14]. 

For vector machines the previous two techniques are likely to perform very poorly because 
they involve vectors that are usually very short. For instance on CRAY-2, the sparse dot- 
product reaches half of the asymptotic speed with vectors of length 150 while the average 
degree per row of sparse matrices usually lies in the range 5-50. An alternative is to use one 
of the schemes based on diagonal and generalized banded format described above. However, 
the following scheme is more general. 

We start from the A,JA,IA data structure and build a new one by constructing what 
we call jagged diagonals [24]. This scheme is related to the stripe structure [16] or to the 
generalized-column wise storage [8]. We store as a dense vector the leftmost element from 
each row, together with two integer vectors containing the row and column positions of each 
element. This is followed by the second jagged diagonal consisting of the elements in second 
position from the left. This storage may also be done starting from the main diagonal, 
especially when storing only the upper- triangular part of the symmetric matrix. As we 
build more and more jagged diagonals, their length decreases. The number of j-diagonals is 
equal to the largest degree of the rows. As an illustration, three sparse matrices have been 
considered for the efficiency of this storage : two matrices arising from the triangularization 
of 2-D and 3-D domains and a stochastic transition probability matrix of a Markov chain; 
their characteristics are described in Table 1 and the lengths of their jagged diagonals are 
pictured in Figure 1. These examples show that, with this storage, most of the non-zero 
entries belong to long vectors. 

If IDIAG(j) is the pointer to the beginning of the j-th jagged diagonal and IROW(k) and 
JDIAG(k) are, respectively, the row position and the column position of the element stored 
in A(k), then the product y = Ax, can be computed as follows, 
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1024 

256 

64 

16 
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Markov 


Figure 1: Lengths of jagged diagonals for three test matrices. 


DO 10 J » 1, NDIAG 
K1 - IDIAG(J) 

K2 » IDIAG(J+1)-1 

Y(IR0W(K1:K2)) ■ Y(IR0W(K1:K2)) ♦ A(K1:K2)*X(JDIAG(K1:K2)) 

10 CONTINUE 

The asymptotic speed which can be obtained on CRAY 2 is 29 MFLOPS. To increase 
speed, one may proceed in two ways. A first way consists of reordering the rows by decreasing 
degree in order to be able to pick the non-zero entries of each jagged diagonal of A from 
consecutive rows; this allows us to avoid the indirect load and store on Y. On CRAY 2, the 
asymptotic speed is then 39 MFLOPS. The second way to increase speed is to unroll the 
loop by blocking several j-diagonals together. Let us assume, that now the sparse matrix is 
stored in a two-dimensional array A such that A(*,k) represents all the entries of the k-th 
jagged diagonals of all the blocks ; NS is the number of diagonals in a block ; IROW(k) and 
JDIAG(k,j) are the row and column indices of the entry stored in A(k,j). This definition 
implies that all the diagonals of a block have the same structure ; it can be achieved by 
adding some virtual non-zero entries. Let NB be the number of blocks and IDIAG(j) the 
pointer to the j-th block. The program becomes 
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r “ METHOD | 


1 

2 

3 

4 

5 

20 x 20 

6.56 

0.19 

8.04 

9.43 

7.11 

20 x 20 x 10 

7.06 

0.21 

7.013 

10.13 

4.05 

30 x30 

oa 

0.19 

8.98 

10.83 

8.19 

30 x 30 x 10 

5.42 

0.21 

6.48 

9.69 

3.32 


Table 2: Megaflop rates for five matrix by vector multiplication kernels on an Alliant FX-80 
(double precision arithmetic). 


DO 10 >1, NB 
K1 » IDIAG(J) 

K2 - IDIAG(J+1)-1 
Y(IR0W(K1:K2)) - Y(IR0W(K1:K2)) 

♦ A(K1:K2,1)*X(JDIAG(K1 :K2,1)) 

* + A(K1:K2,2)*X(JDIAG(K1 :K2,2)) 

* ♦ ... 

* + A(K1 :K2,NS)*X(JDIAG(K1 :K2,NS) ) 

10 CONTINUE 

The parameter NS is called depth of unrolling. The improvement achieved here is due to 
the fact that there is only one indirect load and store of the vector Y for NS indirect loads 
of the vector X. On CRAY 2, the asymptotic speed for NS = 4 is about 55 MFLOPS. 

It is obviously possible to combine both improvements (reordering rows and blocking 
diagonals) to reach higher rates of execution. 

As an illustration we show in the next table the performance, on an Alliant FX-80, of 
the following five different ways of multiplying a matrix by a vector: 

1. Row-wise storage , (sparse dot product form); 

2. Column wise storage, (sparse saxpy form); 

3. I tpack format 

4. Diagonal storage, (triad form); 

5. Jagged diagonal format(with reordered rows); 

The technique of unrolling has not yet been tested on Alliant. We took 5-point and 7-point 
matrices for 2-D and 3-D rectangular grids. Results are displayed in Table 2. Notice the 
wide differences in performance obtained between the various ways of performing the same 
operation. On the Alliant FX-80, method 2, using the column-wise storage is the worst 
performer. Also of interest, and to some extent disturbing, is the variation in performance 
obtained for different matrices with the same kernel. These discrepancies are especially 
noticeable in Kernel 5, using the Jagged diagonal format. 
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One problem that seems to have not yet been studied in the literature is that of performing 
simple operations with general sparse matrices on SIMD machines of the type of the MPP 
or the Connection machine. On such machines much of what has been accomplished is 
to test the usual symmetric conjugate gradient method for easy model problems [3,2], The 
difficulty with the more realistic general sparse problems is the apparent necessity of resorting 
to indirect addressing, a difficult operation on these architectures. Hammond and Law [9] 
propose a hardware solution based on systolic arrays. This challenging problem is to be 
solved to satisfaction before SIMD machines are to be considered real contenders to MIMD 
ones, in the race for usable supercomputers. 

4 Implementations and results on vector processors 

4.1 Implementation of the Lanczos method 

The questions that must be addressed when implementing a Lanczos type algorithm are 
numerous: 

• Should reorthogonalization be used and which form of reorthogonalization? 

• How often should one compute the approximate eigenvalues and how does one monitor 
convergence. 

• What is the best way of computing or updating the eigenvalues of T m ? 

We will now answer some of these questions. The trade-offs between reorthogonalizing and 
not reorthogonalizing in a Lanczos code are similar to those on a scalar machine, namely cost 
versus ease of implementations and simplicity. There are circumstances where reorthogonal- 
ization of some sort is essential. For example if the Lanczos algorithm is used in conjunction 
with shift-and-invert then the price of reorthogonalization is worth paying. The reason is 
that the major cost in this case is the factorization and there will be more factorizations if 
Lanczos is slower to converge as is the case when reorthogonalization is skipped. Relatively 
speaking, factorizations of unstructured sparse matrices are even more expensive on vector 
machines than on scalar machines so the case for even a full reorthogonalization in shift and 
invert is stronger. If a large number of eigenvalues must be computed, then reorthogonal- 
ization is an option that might become too expensive. 

We are interested in the problem of computing a few of the largest or smallest eigenvalues 
of A. We chose to implement an algorithm without reorthogonalization only because of the 
interesting challenges that the implementation of this technique brings to vector and paral- 
lel processing. When full reorthogonalization is used one has only to solve the problems of 
implementing a good Gram-Schmidt algorithm and a good matrix by vector multiplication 
routine. The question of tracking convergence and analyze T do not matter as much. The 
other forms of reorthogonalization, such as Selective Orthogonalization or Partial Reorthog- 
onalization [21] are in fact similar in nature to those of Lanczos without reorthogonalization 
but more complex to implement. 
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Here are a few points of detail on SLAN a simple Lanczos code without reorthogonaliza- 
tion, which we implemented and tested. 

• No reorthogonalization of any form is used. 

• The algorithm computes only the k leftmost or k rightmost eigenvalues of A. 

• Eigenvectors associated with these eigenvalues may be computed if desired. If so, these 
are accumulated from recomputed Lanczos vectors at the end of the process. 

• The code eliminates any new eigenvalue that is very close to an already computed 
eigenvalue. 

• Convergence is tracked from left to right for the smallest eigenvalues and from right to 
left for the largest ones. As a result, we only extract the eigenvalues of the tridiagonal 
matrix that are likely to converge next. The number of eigenvalues to extract is 
estimated from the number of desired eigenvalues and the rates of convergence. 


4.2 Analyzing T on vector and parallel processors 

Although computing the eigenvalues of a tridiagonal matrix in the Lanczos algorithm may 
seem negligible on a scalar machine, it is no longer the case in a supercomputing environment. 
The standard algorithm used for computing eigenvalues of tridiagonal matrices is the QL 
algorithm. In the context of the Lanczos algorithm it is more natural to use bisection type 
algorithm for several reasons [20]. In order to exploit vectorization we will use a form of 
multisectionning similar to that proposed by Lo and Philippe [15]. As in [15], we start by a 
first phase of isolation, where we seek an interval for each eigenvalue that does not contain 
any other eigenvalue. This process is sequential and consists of sweeping the data points 
from left to right and to determine which of the points is the closest to a given eigenvalue 
from the left and then from the right. Once a separate interval has been found for each 
eigenvalue, we can proceed to the next phase which consists of a refinement phase. Let us 
assume that we have a set of intervals [aj, bj] each of which contains exactly one eigenvalue 
and let Xj = (a, + bj )/ 2 be their middles. To be able to refine a given interval, we may use 
the usual Sturm sequence property and compute the (divided) Sturm sequences, 




ft 

<7,-i (*;) 


( 3 ) 


starting from < t 0 ( x ,) = 1. As is well-known, the number v(ij) of negative signs in the 
sequence < 7 j(xj), i = 1, 2, .., m counts the number of eigenvalues of T m that are located at the 
left of xj. Therefore, the left interval bound aj will be moved to the middle Xj, whenever 
v{xj) is less than j otherwise bj is moved to Xj. Note that the operations to compute the 
sequence <r t (x_,) is sequential with respect to the subscript i, but it vectorizes across the 
points Xj. 

After a certain number of additional bisections steps we come close enough to the solution 
that we can use a modified Newton iteration. We call this the third phase of the algorithm. 
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The Secant method is used instead of the traditional Newton. The iteration is applied to the 
rational function <r m (t ) defined above which, as is well-known, is nothing but the ratio of the 
characteristic polynomial of T m over that of T m _ i. For any value t that is not an eigenvalue 
of the intermediate matrices T,, * = 1,2, ...,m — 1, the function <r m (t) can be evaluated with 
the help of the recursion (3). If we call xj the current approximation of the eigenvalue in the 
interval [aj, 6,-], then we start by computing every cr m (xj) from the above recurrence, and 
then we define the new iterate by a secant step, namely, 


Xj — Xj CT m (Xj) 


b : ~ °j 

*m(bj) - er m (aj) 


( 4 ) 


The algorithm for computing the eigenvalues A mi , . . . , A m2 , can be sketched as follows. 


Vectorized Multisectioning Algorithm (VMSEC) 

A. Start: 

• For j = mi, . . . , m 2 , define aj — a, bj = 6, with a, 6 = Gershgorin lower and upper 
bounds. 

• Define x, = a +j(6— a)/(m 2 — mi + l),j = mi,..., m 2 

• Define Phase =1. 

B. Sturm: For k = 1, ..., m Do 

Compute (Vector mode) cr(xj) and v(xj) for j = mi, . . . , m 2 . 

C. Phase 1: If Phase=l then, 

1. Start x* = x m , 

2. For i = mi . . . m 2 Do: 

• Starting from previous x, determine the smallest Xj that is larger than A,-. 
Call this value x. and denote the old x, by x„_i. 

• If bj > x. then bj := x„ 

• If aj < x*_i then a< = x,_i. 

3. Test for Phase 2: If i/(bj) = i and i/(a,) = * — 1 for all t = m x , ..., m 2 then Phase=2. 

4. If Phase=l then define new set of x' s by putting in each interval [ay, bj] a number 
of v(bj) — i '(aj) equally distributed points. 

D. Phase 2: If (Phase = 2 ) then 

1. For j = mi,m 2 Do 

\i u{xj) > j then bj = Xj else aj — Xj 

2. For j = mi, . . . , m 2 Do Xj — (a, + bj)/ 2. 

3. Test for Phase 3: if b(j ) — a(j) < tol * [aj+i — 6y_i], j = mi, ..., m 2 Phase = 3. 

E. Phase 3: If (Phase = 3) Compute new sequence xy, j = mi, . . . , m 2 according to (4). 

F. Convergence test: If all eigenvalues have converged then exit else goto B. 
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mi, m 2 

| VMSEC 

TQLl 

TRIDIB 

PHASE 1 

PHASE 2 

PHASE 3 

TOTAL 

MFLOPS 

1 , 10 

7.1 

17 

12 

26.8 

12 

- 

213 

1 , 50 

12.3 

44.5 

2.7 

59.5 

26 

- 

1,100 

1 , 100 

24.2 

87.1 

5.4 

117 

27 

- 

2,200 

1 , 500 

93.8 

440 

27.4 

550 

29 

600 

10,700 


Table 3: Times for the different phases in Algorithm VMSEC on a Cray 2 and for the 
corresponding EISPACK routines (All times in milliseconds). 

Note that the only part that does not vectorize in the above algorithm is Phase 1. One 
might ask what percentage of the total run time this phase will usually take. In Table 3 we 
show a break-down of the total time with respect to the three phases. The matrix is the 
classical Tridiag[— 1,2, —1] of size n — 500. We consider four different cases for the values 
of mi and m 2 . 

As can be seen from Table 3, Phase 1 is by no means dominating even if one computes a 
large number of eigenvalues. In fact the number of Phase 1 iterations required is usually less 
than 3 or 4. Then the algorithm proceeds to the most time consuming part which is Phase 
2. Phase 3 also takes a small number of steps, usually less than 5. 

An interesting comparison to make is with standard routines from EISPACK. We com- 
pared VMSEC with both Tridib and TQLl which are the two competitive routines from the 
scientific library. Table 3 gives also the times for the three methods for the same matrix 
and the same machine. We asked Tridib and VMSEC to compute exactly the same eigenval- 
ues and gave the same initial intervals to both routines, namely the interval [0, 4] given by 
Gershgorin’s theorem for the test matrix considered. Note that even for computing all the 
eigenvalues of a large matrix VMSEC is now competitive with TQLl, the method of choice in 
this case. We should point out that if we do not need to compute the eigenvectors, as is the 
case in ’Analyze T’, then Cuppen’s divide and conquer algorithm [5,7] is not competitive. A 
remarkable observation is that when computing all the eigenvalues the two algorithms TQLl 
and VMSEC are very close even for larger matrices. For computing a smaller number of 
eigenvalues VMSEC is invariably superior. 

Also of importance in the Lanczos algorithm is the computation of residual norms (A — 
which as is well-known [21] can be readily obtained from the last component of 
the eigenvector 

||(A-AS m) /)u! m) ||2 = ^ +1 |e^S m) | (5) 

To compute inexpensively the last component of y\ m \ we exploit the observation made in 
[20] that this component is equal to the last term of the sequence < 7 fc(Aj m ^), which is available 
for free from the Sturm sequence computation. 
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4.3 Implementation of Davidson’s method 

The routine developed computes the lowest eigenvalues of a symmetric sparse matrix A and 
their associated vectors. The algorithm is expressed in a block version using the Jacobi 
preconditionner ; it is organized as follows 

Initializations : initial guess for the first block ; 

Outer loop : 

Inner loop : 

• CALL MATMULT (multiplication of the last block by A) ; 

• CALL MVTW (update of the interaction matrix H) ; 

• Computation of the NB lowest eigenvalues and of their associated vectors by 
EISPACK (TRED1 - TRIDIB - TINVIT - TRBAK1) ; 

• CALL RITZ (computation of the Ritz vectors, which are candidate for a new 
block, and of their residuals) ; 

• Test for convergence ; 

• Test for ending the inner loop (too many vectors in the basis) ; 

• CALL CORRECT (preconditioning every vector of the new block) 

• CALL COMPL (projection of the new block onto the orthogonal complement 
of the subspace already spanned since the beginning of the outer iteration) ; 

• CALL GRAMS (Modified Gram-Schmidt orthogonalization on the new block; 
only are kept the vectors having a significant contribution) ; 

CALL GRAMS (On the last Ritz vectors which will be used to start a new outer 
iteration) ; 


The following example illusrates a typical run. The test matrix has been randomly 
built by defining its sparsity (« 0.01), and by choosing the diagonal entries in the range 
[—a,0] and the non-zero off-diagonal entries in the range [—/?,/?], where a = 10 and /? = 1. 
The sparsity pattern is unstructured. In Table 4, the relative run times of the different 
parts of the computation are reported with corresponding MFLOPS rates. Convergence was 
obtained during the 4-th outer iteration. The computation involved 316 sparse matrix x 
vector multiplications. The subroutine which performs these multiplications (MATMULT) 
has to be provided by the user. Here only the upper-part of the matrix has been stored 
row-wise and the multiplication is performed using the SPAXPY and SPDOT routines. By 
considering full storage, reordering the rows in decreasing degree and by storing blocks of 
two jagged diagonals, the sparse multiplication reaches about 27 MFLOPS; in this situation, 
the corresponding part of the computation drops to 6 % of the overall process. 

When calling the routine DAVID, the user defines the block size and the maximum size 
for the basis. To minimize the number of iterations, he must choose the smallest block 
size, i.e. the number of sought eigenpairs. However, if he knows that there are some close 
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computation 

% of running time 

MFLOPS 

EISPACK 

40 % 

— 

MATMULT 

36 % 

3 

COMPL 

14 % 

37 1 

RITZ 

4% 

130 

MVTW 

4% 

130 

Remaining 

2 % 

— 


Table 4: Percentages of times for the different phases in the Davidson Algorithm on a CRAY 
2 (1 processor) 

eigenvalues, he may define the block size so that the Ritz vectors corresponding to these 
eigenvalues are computed. For the maximal size of the basis, the user is first limited by the 
memory capacity. He has also to maintain the orthogonality of the basis ; for that purpose, 
there is a control on the new vectors which are incorporated to the basis : those which 
are, before projection onto the complement of the basis, almost spanned by them, are lost. 
When there is some loss of orthogonality of the basis, the residuals stop to decrease and the 
Ritz vectors are a worse initial guess for the next outer iteration than they were some inner 
iterations before. 

To conclude, we may assert that the algorithm is well suited for vector processors since the 
main part of the computation is expressed by vector operations or even by matrix operations. 
We have mainly tackled the important problem of sparse matrix by vector multiplication. 
It remains to optimize the solution of the eigenvalue problem that is solved at every step. 
In the current implementation, the routines of tridiagonalization and back transformation 
(TREDl and TRBAKl) are well vectorized. In contrast TRIDIB and TINVIT are sequential 
; the first one may be replaced by the algorithm VMSEC but a vector version of TINVIT has 
yet to be designed. Further improvements might then be achieved by using at one step the 
previous estimates of the eigenvalues and by computing the eigenvectors by inverse iterations 
from the untransformed interaction matrix. 

4.4 Comparing the two methods 

In this section, we discuss the domain of applicability of Davidson’s method vs. the Lanczos 
algorithm. Since we restrict our study to the Jacobi preconditioner, we are interested at 
the effect of diagonal dominance of the matrix on execution times. Here, the term diagonal 
dominant matrix is loosely used to refer to a matrix whose diagonal entries vary substantially 
relatively to the other elements of the matrix. Table 5 reports run times, on a CRAY X-MP, 
for both methods when looking for the lowest five eigenvalues and corresponding eigenvectors 
of a randomly generated symmetric sparse matrix. Tests with our codes have been carried out 
by P. Harten [10]. The matrix is of order 1000 and has a full diagonal whose elements are first 
taken as random numbers between 0 and 1, and then multiplied by a diagonal scaling factor; 
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diagonal factor 

Davidson 

Lanczos 

10 

6.757 

1.455 

20 

3.969 

1.547 

30 

3.411 

1.841 

40 

2.699 

1.950 

| 50 

2.380 

2.692 

60 

2.322 

2.140 

70 

2.287 

2.396 

80 

2.168 

2.751 

90 

2.196 

7.804 

100 

1.996 

6.705 

110 

1.993 

7.099 


Table 5: Davidson - Lanczos run time comparisons (seconds) on a CRAY X-MP (1 processor). 

the off-diagonal elements are random numbers between -1 and +1; the positions of these 
non-zero off-diagonal elements are randomly selected as to give a specified density, namely 

0.01. The results clearly show that the methods have an opposite behavior with respect to 
diagonal dominance. However, it may be asserted that the Lanczos algorithm has a broader 
spectrum of applicability than Davidson’s method since in case of no diagonal dominance, 
the second method may fail. Davidson’s method is efficient in specific applications such as 
in Quantum Chemistry. 


5 Towards Parallel Implementations 

Parallelism can easily be exploited in almost all the steps of both methods: matrix or vector 
operations are well adapted since the length of the involved vectors is large and multiplying 
by the sparse matrix can be efficiently performed as already seen. Only two bottlenecks 
occur, namely 

1. the eigenvalue problem for the reduced matrix (interaction matrix), 

2. the orthogonalization process. 

The eigenvalue problem in (1) is different for the two algorithms. In the Lanczos process, 
the matrix under consideration can be very large (especially if only eigenvalues are required) 
but it is tridiagonal. In contrast, for Davidson’s method the interaction matrix is full and 
usually of smaller order; as is usually done, it must first be put into tridiagonal form. To 
parallelize the calculation of the eigenvalues, it is possible to split their range into several 
pieces which are distributed among all the processors and then apply algorithm VMSEC 
independently on them. However, if the number of sought eigenvalues is small the algorithm 
in [15] which relies only on parallelism and not of vectorization, can be more effective. 
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Parallelizing the computation of the eigenvectors is theoretically much easier since the inverse 
iteration can be done in parallel for every wanted vector. However, if eigenvalues are clustered 
then some attention must be paid to maintaining orthogonality. This was addressed in [15]. 

In these algorithms orthogonalization is performed by the Modified Gram-Schmidt pro- 
cess (MGS): 


do k = 1 , m 

normalize v * 
do j = k+1, m 

Vj := Vj - 

where the vectors to be orthogonalized are u *, k = l,...,m. Obviously, the inner loop can 
be run in parallel. To increase the efficiency, it is possible to add synchronizing barriers in 
order to insure that as soon one vector has received all the corrections from the previous 
vectors, it is normalized and is then ready to correct the next vectors. On a CRAY X-MP/48 
(4 processors) this modification improves the speed-up of the process from 2.2 to 3.2 [15]. 
Another algorithm consists of replacing in (MGS) the vectors by blocks of vectors; then the 
normalization of one vector is replaced by applying (MGS) to the block. This algorithm does 
not have equivalent stability properties but carefully used it can be very efficient, especially 
to exploit data locality in hierarchical memory systems [11]. 

The parallelization of the methods does not seem to be too difficult to carry out at least 
on a moderately large number of processors. Our first tests in this direction indicate that 
by using CRAY microtasking techniques fairly good speed-ups can be expected. 
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